Land cover classification at the Mississppi Delta

In this notebook, you will use a k-means unsupervised clustering algorithm to group pixels by similar spectral signatures. k-means is an exploratory method for finding patterns in data. Because it is unsupervised, you don’t need any training data for the model. You also can’t measure how well it “performs” because the clusters will not correspond to any particular land cover class. However, we expect at least some of the clusters to be identifiable as different types of land cover.

You will use the harmonized Sentinal/Landsat multispectral dataset. You can access the data with an Earthdata account and the earthaccess library from NSIDC:

STEP 1: SET UP

Try It
  1. Import all libraries you will need for this analysis
  2. Configure GDAL parameters to help avoid connection errors: python os.environ["GDAL_HTTP_MAX_RETRY"] = "5" os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"
See our solution!
import os
import pickle
import re
import warnings

import cartopy.crs as ccrs
import earthaccess
import earthpy as et
import geopandas as gpd
import geoviews as gv
import hvplot.pandas
import hvplot.xarray
import numpy as np
import pandas as pd
import rioxarray as rxr
import rioxarray.merge as rxrmerge
from tqdm.notebook import tqdm
import xarray as xr
from shapely.geometry import Polygon
from sklearn.cluster import KMeans

os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

warnings.simplefilter('ignore')

Below you can find code for a caching decorator which you can use in your code. To use the decorator:

@cached(key, override)
def do_something(*args, **kwargs):
    ...
    return item_to_cache

This decorator will pickle the results of running the do_something() function, and only run the code if the results don’t already exist. To override the caching, for example temporarily after making changes to your code, set override=True. Note that to use the caching decorator, you must write your own function to perform each task!

def cached(func_key, override=False):
    """
    A decorator to cache function results
    
    Parameters
    ==========
    key: str
      File basename used to save pickled results
    override: bool
      When True, re-compute even if the results are already stored
    """
    def compute_and_cache_decorator(compute_function):
        """
        Wrap the caching function
        
        Parameters
        ==========
        compute_function: function
          The function to run and cache results
        """
        def compute_and_cache(*args, **kwargs):
            """
            Perform a computation and cache, or load cached result.
            
            Parameters
            ==========
            args
              Positional arguments for the compute function
            kwargs
              Keyword arguments for the compute function
            """
            # Add an identifier from the particular function call
            if 'cache_key' in kwargs:
                key = '_'.join((func_key, kwargs['cache_key']))
            else:
                key = func_key

            path = os.path.join(
                et.io.HOME, et.io.DATA_NAME, 'jars', f'{key}.pickle')
            
            # Check if the cache exists already or override caching
            if not os.path.exists(path) or override:
                # Make jars directory if needed
                os.makedirs(os.path.dirname(path), exist_ok=True)
                
                # Run the compute function as the user did
                result = compute_function(*args, **kwargs)
                
                # Pickle the object
                with open(path, 'wb') as file:
                    pickle.dump(result, file)
            else:
                # Unpickle the object
                with open(path, 'rb') as file:
                    result = pickle.load(file)
                    
            return result
        
        return compute_and_cache
    
    return compute_and_cache_decorator

STEP 2: STUDY SITE

For this analysis, you will use a watershed from the Water Boundary Dataset, HU12 watersheds (WBDHU12.shp).

Try It
  1. Download the Water Boundary Dataset for region 8 (Mississippi)
  2. Select watershed 080902030506
  3. Generate a site map of the watershed

Try to use the caching decorator

See our solution!
@cached('wbd_08')
def read_wbd_file(wbd_filename, huc_level, cache_key):
    # Download and unzip
    wbd_url = (
        "https://prd-tnm.s3.amazonaws.com"
        "/StagedProducts/Hydrography/WBD/HU2/Shape/"
        f"{wbd_filename}.zip")
    wbd_dir = et.data.get_data(url=wbd_url)
                  
    # Read desired data
    wbd_path = os.path.join(wbd_dir, 'Shape', f'WBDHU{huc_level}.shp')
    wbd_gdf = gpd.read_file(wbd_path, engine='pyogrio')
    return wbd_gdf

huc_level = 12
wbd_gdf = read_wbd_file(
    "WBD_08_HU2_Shape", huc_level, cache_key=f'hu{huc_level}')

delta_gdf = (
    wbd_gdf[wbd_gdf[f'huc{huc_level}']
    .isin(['080902030506'])]
    .dissolve()
)

(
    delta_gdf.to_crs(ccrs.Mercator())
    .hvplot(
        alpha=.2, fill_color='white', 
        tiles='EsriImagery', crs=ccrs.Mercator())
    .opts(width=600, height=300)
)
Downloading from https://prd-tnm.s3.amazonaws.com/StagedProducts/Hydrography/WBD/HU2/Shape/WBD_08_HU2_Shape.zip
Extracted output to /home/runner/earth-analytics/data/earthpy-downloads/WBD_08_HU2_Shape

We chose this watershed because it covers parts of New Orleans an is near the Mississippi Delta. Deltas are boundary areas between the land and the ocean, and as a result tend to contain a rich variety of different land cover and land use types.

Write a 2-3 sentence site description (with citations) of this area that helps to put your analysis in context.

YOUR SITE DESCRIPTION HERE

STEP 3: MULTISPECTRAL DATA

Search for data

Try It
  1. Log in to the earthaccess service using your Earthdata credentials: python earthaccess.login(persist=True)
  2. Modify the following sample code to search for granules of the HLSL30 product overlapping the watershed boundary from May to October 2023 (there should be 76 granules): python results = earthaccess.search_data( short_name="...", cloud_hosted=True, bounding_box=tuple(gdf.total_bounds), temporal=("...", "..."), )
# Log in to earthaccess

# Search for HLS tiles
See our solution!
# Log in to earthaccess
earthaccess.login(persist=True)
# Search for HLS tiles
results = earthaccess.search_data(
    short_name="HLSL30",
    cloud_hosted=True,
    bounding_box=tuple(delta_gdf.total_bounds),
    temporal=("2024-06", "2024-08"),
)

Compile information about each granule

I recommend building a GeoDataFrame, as this will allow you to plot the granules you are downloading and make sure they line up with your shapefile. You could also use a DataFrame, dictionary, or a custom object to store this information.

Try It
  1. For each search result:
    1. Get the following information (HINT: look at the [‘umm’] values for each search result):
      • granule id (UR)
      • datetime
      • geometry (HINT: check out the shapely.geometry.Polygon class to convert points to a Polygon)
    2. Open the granule files. I recomment opening one granule at a time, e.g. with (earthaccess.open([result]).
    3. For each file (band), get the following information:
      • file handler returned from earthaccess.open()
      • tile id
      • band number
  2. Compile all the information you collected into a GeoDataFrame
# Loop through each granule

    # Get granule information

    # Get URL

    # Build metadata DataFrame rows

# Concatenate metadata DataFrame
See our solution!
def get_earthaccess_links(results):
    url_re = re.compile(
        r'\.(?P<tile_id>\w+)\.\d+T\d+\.v\d\.\d\.(?P<band>[A-Za-z0-9]+)\.tif')

    # Loop through each granule
    link_rows = []
    url_dfs = []
    for granule in tqdm(results):
        # Get granule information
        info_dict = granule['umm']
        granule_id = info_dict['GranuleUR']
        datetime = pd.to_datetime(
            info_dict
            ['TemporalExtent']['RangeDateTime']['BeginningDateTime'])
        points = (
            info_dict
            ['SpatialExtent']['HorizontalSpatialDomain']['Geometry']['GPolygons'][0]
            ['Boundary']['Points'])
        geometry = Polygon(
            [(point['Longitude'], point['Latitude']) for point in points])
        
        # Get URL
        files = earthaccess.open([granule])

        # Build metadata DataFrame
        for file in files:
            match = url_re.search(file.full_name)
            if match is not None:
                link_rows.append(
                    gpd.GeoDataFrame(
                        dict(
                            datetime=[datetime],
                            tile_id=[match.group('tile_id')],
                            band=[match.group('band')],
                            url=[file],
                            geometry=[geometry]
                        ),
                        crs="EPSG:4326"
                    )
                )

    # Concatenate metadata DataFrame
    file_df = pd.concat(link_rows).reset_index(drop=True)
    return file_df

Open, crop, and mask data

This will be the most resource-intensive step. I recommend caching your results using the cached decorator or by writing your own caching code. I also recommend testing this step with one or two dates before running the full computation.

This code should include at least one function including a numpy-style docstring. A good place to start would be a function for opening a single masked raster, applying the appropriate scale parameter, and cropping.

Try It
  1. For each granule:
    1. Open the Fmask band, crop, and compute a quality mask for the granule. You can use the following code as a starting point, making sure that mask_bits contains the quality bits you want to consider: ```python # Expand into a new dimension of binary bits bits = ( np.unpackbits(da.astype(np.uint8), bitorder=‘little’) .reshape(da.shape + (-1,)) )

      # Select the required bits and check if any are flagged mask = np.prod(bits[…, mask_bits]==0, axis=-1) ```

    2. For each band that starts with ‘B’:

      1. Open the band, crop, and apply the scale factor
      2. Name the DataArray after the band using the .name attribute
      3. Apply the cloud mask using the .where() method
      4. Store the DataArray in your data structure (e.g. adding a GeoDataFrame column with the DataArray in it. Note that you will need to remove the rows for unused bands)
# Loop through each image

    # Open granule cloud cover

    # Compute cloud mask

    # Loop through each spectral band

        # Open, crop, and mask the band

        # Add the DataArray to the metadata DataFrame row

    # Reassemble the metadata DataFrame
See our solution!
@cached('delta_reflectance_da_df')
def compute_reflectance_da(search_results, boundary_gdf):
    """
    Connect to files over VSI, crop, cloud mask, and wrangle
    
    Returns a single reflectance DataFrame 
    with all bands as columns and
    centroid coordinates and datetime as the index.
    
    Parameters
    ==========
    file_df : pd.DataFrame
        File connection and metadata (datetime, tile_id, band, and url)
    boundary_gdf : gpd.GeoDataFrame
        Boundary use to crop the data
    """
    def open_dataarray(url, boundary_proj_gdf, scale=1, masked=True):
        # Open masked DataArray
        da = rxr.open_rasterio(url, masked=masked).squeeze() * scale
        
        # Reproject boundary if needed
        if boundary_proj_gdf is None:
            boundary_proj_gdf = boundary_gdf.to_crs(da.rio.crs)
            
        # Crop
        cropped = da.rio.clip_box(*boundary_proj_gdf.total_bounds)
        return cropped
    
    def compute_quality_mask(da, mask_bits=[1, 2, 3]):
        """Mask out low quality data by bit"""
        # Unpack bits into a new axis
        bits = (
            np.unpackbits(
                da.astype(np.uint8), bitorder='little'
            ).reshape(da.shape + (-1,))
        )

        # Select the required bits and check if any are flagged
        mask = np.prod(bits[..., mask_bits]==0, axis=-1)
        return mask

    file_df = get_earthaccess_links(search_results)
    
    granule_da_rows= []
    boundary_proj_gdf = None

    # Loop through each image
    group_iter = file_df.groupby(['datetime', 'tile_id'])
    for (datetime, tile_id), granule_df in tqdm(group_iter):
        print(f'Processing granule {tile_id} {datetime}')
              
        # Open granule cloud cover
        cloud_mask_url = (
            granule_df.loc[granule_df.band=='Fmask', 'url']
            .values[0])
        cloud_mask_cropped_da = open_dataarray(cloud_mask_url, boundary_proj_gdf, masked=False)

        # Compute cloud mask
        cloud_mask = compute_quality_mask(cloud_mask_cropped_da)

        # Loop through each spectral band
        da_list = []
        df_list = []
        for i, row in granule_df.iterrows():
            if row.band.startswith('B'):
                # Open, crop, and mask the band
                band_cropped = open_dataarray(
                    row.url, boundary_proj_gdf, scale=0.0001)
                band_cropped.name = row.band
                # Add the DataArray to the metadata DataFrame row
                row['da'] = band_cropped.where(cloud_mask)
                granule_da_rows.append(row.to_frame().T)
    
    # Reassemble the metadata DataFrame
    return pd.concat(granule_da_rows)

reflectance_da_df = compute_reflectance_da(results, delta_gdf)
Processing granule T15RYN 2024-06-07 16:31:11.509000+00:00
Processing granule T15RYP 2024-06-07 16:31:11.509000+00:00
Processing granule T16RBT 2024-06-07 16:31:11.509000+00:00
Processing granule T16RBU 2024-06-07 16:31:11.509000+00:00
Processing granule T15RYN 2024-06-15 16:31:19.154000+00:00
Processing granule T15RYP 2024-06-15 16:31:19.154000+00:00
Processing granule T16RBT 2024-06-15 16:31:19.154000+00:00
Processing granule T16RBU 2024-06-15 16:31:19.154000+00:00
Processing granule T15RYN 2024-06-23 16:31:21.277000+00:00
Processing granule T15RYP 2024-06-23 16:31:21.277000+00:00
Processing granule T16RBT 2024-06-23 16:31:21.277000+00:00
Processing granule T16RBU 2024-06-23 16:31:21.277000+00:00
Processing granule T15RYN 2024-07-01 16:31:17.338000+00:00
Processing granule T15RYP 2024-07-01 16:31:17.338000+00:00
Processing granule T16RBT 2024-07-01 16:31:17.338000+00:00
Processing granule T16RBU 2024-07-01 16:31:17.338000+00:00
Processing granule T15RYN 2024-07-09 16:31:29.187000+00:00
Processing granule T15RYP 2024-07-09 16:31:29.187000+00:00
Processing granule T16RBT 2024-07-09 16:31:29.187000+00:00
Processing granule T16RBU 2024-07-09 16:31:29.187000+00:00
Processing granule T15RYN 2024-07-17 16:31:33.271000+00:00
Processing granule T15RYP 2024-07-17 16:31:33.271000+00:00
Processing granule T16RBT 2024-07-17 16:31:33.271000+00:00
Processing granule T16RBU 2024-07-17 16:31:33.271000+00:00
Processing granule T15RYN 2024-07-25 16:31:43.265000+00:00
Processing granule T15RYP 2024-07-25 16:31:43.265000+00:00
Processing granule T16RBT 2024-07-25 16:31:43.265000+00:00
Processing granule T16RBU 2024-07-25 16:31:43.265000+00:00
Processing granule T15RYN 2024-08-02 16:31:36.413000+00:00
Processing granule T15RYP 2024-08-02 16:31:36.413000+00:00
Processing granule T16RBT 2024-08-02 16:31:36.413000+00:00
Processing granule T16RBU 2024-08-02 16:31:36.413000+00:00
Processing granule T15RYN 2024-08-10 16:31:42.335000+00:00
Processing granule T15RYP 2024-08-10 16:31:42.335000+00:00
Processing granule T16RBT 2024-08-10 16:31:42.335000+00:00
Processing granule T16RBU 2024-08-10 16:31:42.335000+00:00
Processing granule T15RYN 2024-08-18 16:31:46.256000+00:00
Processing granule T15RYP 2024-08-18 16:31:46.256000+00:00
Processing granule T16RBT 2024-08-18 16:31:46.256000+00:00
Processing granule T16RBU 2024-08-18 16:31:46.256000+00:00
Processing granule T15RYN 2024-08-26 16:31:51.172000+00:00
Processing granule T15RYP 2024-08-26 16:31:51.172000+00:00
Processing granule T16RBT 2024-08-26 16:31:51.172000+00:00
Processing granule T16RBU 2024-08-26 16:31:51.172000+00:00

Merge and Composite Data

You will notice for this watershed that: 1. The raster data for each date are spread across 4 granules 2. Any given image is incomplete because of clouds

Try It
  1. For each band:

    1. For each date:

      1. Merge all 4 granules
      2. Mask any negative values created by interpolating from the nodata value of -9999 (rioxarray should account for this, but doesn’t appear to when merging. If you leave these values in they will create problems down the line)
    2. Concatenate the merged DataArrays along a new date dimension

    3. Take the mean in the date dimension to create a composite image that fills cloud gaps

    4. Add the band as a dimension, and give the DataArray a name

  2. Concatenate along the band dimension

# Merge and composite and image for each band

        # Merge granules for each date

        # Mask negative values

    # Composite images across dates
See our solution!
@cached('delta_reflectance_da')
def merge_and_composite_arrays(granule_da_df):
    # Merge and composite and image for each band
    df_list = []
    da_list = []
    for band, band_df in tqdm(granule_da_df.groupby('band')):
        merged_das = []
        for datetime, date_df in tqdm(band_df.groupby('datetime')):
            # Merge granules for each date
            merged_da = rxrmerge.merge_arrays(list(date_df.da))
            # Mask negative values
            merged_da = merged_da.where(merged_da>0)
            merged_das.append(merged_da)
            
        # Composite images across dates
        composite_da = xr.concat(merged_das, dim='datetime').median('datetime')
        composite_da['band'] = int(band[1:])
        composite_da.name = 'reflectance'
        da_list.append(composite_da)
        
    return xr.concat(da_list, dim='band')

reflectance_da = merge_and_composite_arrays(reflectance_da_df)
reflectance_da
<xarray.DataArray 'reflectance' (band: 10, y: 556, x: 624)> Size: 14MB
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
...
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32)
Coordinates:
  * x            (x) float64 5kB 7.926e+05 7.926e+05 ... 8.112e+05 8.113e+05
  * y            (y) float64 4kB 3.304e+06 3.304e+06 ... 3.287e+06 3.287e+06
    spatial_ref  int64 8B 0
  * band         (band) int64 80B 1 2 3 4 5 6 7 9 10 11

STEP 4: K-MEANS

Cluster your data by spectral signature using the k-means algorithm.

Try It
  1. Convert your DataArray into a tidy DataFrame of reflectance values (hint: check out the .to_dataframe() and .unstack() methods)
  2. Filter out all rows with no data (all 0s or any N/A values)
  3. Fit a k-means model. You can experiment with the number of groups to find what works best.
# Convert spectral DataArray to a tidy DataFrame

# Running the fit and predict functions at the same time.
# We can do this since we don't have target data.

# Add the predicted values back to the model DataFrame
See our solution!
# Convert spectral DataArray to a tidy DataFrame
model_df = reflectance_da.to_dataframe().reflectance.unstack('band')
model_df = model_df.drop(columns=[10, 11]).dropna()

# Running the fit and predict functions at the same time.
# We can do this since we don't have target data.
prediction = KMeans(n_clusters=6).fit_predict(model_df.values)

# Add the predicted values back to the model DataFrame
model_df['clusters'] = prediction
model_df
band 1 2 3 4 5 6 7 9 clusters
y x
3.303783e+06 810148.062907 0.01560 0.0225 0.0409 0.0366 0.0478 0.0281 0.0181 0.0006 5
810178.062907 0.01895 0.0256 0.0396 0.0413 0.0426 0.0284 0.0220 0.0006 5
810208.062907 0.01915 0.0246 0.0387 0.0377 0.0384 0.0273 0.0236 0.0007 5
810238.062907 0.02040 0.0247 0.0440 0.0445 0.0629 0.0418 0.0266 0.0007 0
810268.062907 0.01815 0.0245 0.0437 0.0444 0.0618 0.0397 0.0259 0.0008 0
... ... ... ... ... ... ... ... ... ... ...
3.287163e+06 793768.062907 0.02650 0.0345 0.0548 0.0427 0.0218 0.0098 0.0074 0.0007 5
793798.062907 0.02790 0.0351 0.0549 0.0439 0.0221 0.0104 0.0076 0.0008 5
793828.062907 0.02580 0.0331 0.0534 0.0419 0.0194 0.0080 0.0059 0.0009 5
793858.062907 0.02570 0.0326 0.0521 0.0402 0.0182 0.0064 0.0046 0.0007 5
793888.062907 0.02550 0.0340 0.0541 0.0423 0.0199 0.0083 0.0060 0.0007 5

317917 rows × 9 columns

STEP 5: PLOT

Try It

Create a plot that shows the k-means clusters next to an RGB image of the area. You may need to brighten your RGB image by multiplying it by 10. The code for reshaping and plotting the clusters is provided for you below, but you will have to create the RGB plot yourself!

So, what is .sortby(['x', 'y']) doing for us? Try the code without it and find out.

# Plot the k-means clusters
(
    rgb_plot
    + 
    model_df.clusters.to_xarray().sortby(['x', 'y']).hvplot(
        cmap="Colorblind", aspect='equal') 
)
See our solution!
rgb = reflectance_da.sel(band=[4, 3, 2])
rgb_uint8 = (rgb * 255).astype(np.uint8).where(rgb!=np.nan)
rgb_bright = rgb_uint8 * 10
rgb_sat = rgb_bright.where(rgb_bright < 255, 255)

(
    rgb_sat.hvplot.rgb( 
        x='x', y='y', bands='band',
        data_aspect=1,
        xaxis=None, yaxis=None)
    + 
    model_df.clusters.to_xarray().sortby(['x', 'y']).hvplot(
        cmap="Colorblind", aspect='equal') 
)
Reflect and Respond

Don’t forget to interpret your plot!

YOUR PLOT HEADLINE AND DESCRIPTION HERE